Layout Language models were introduced, inspired by the BERT model where input textual information is represented by text embeddings and position embeddings.
LayoutLM further adds two types of input embeddings:
LayoutLM is the first model where text and layout are jointly learned in a single framework for document level pre-training.
LayoutLM is a simple but effective multi-modal pre-training method of text, layout, and image for visually-rich document understanding and information extraction tasks, such as form understanding and receipt understanding.
We have differnt versions of Layout LM like LayoutLM, LayoutLMv2 and LayoutLMv3 and all the models performs way better than the SOTA(State of the Art) Models results on multiple datasets.
We can use the Layout language model from the transformers python module.
We can use the LayoutLMv3FeatureExtractor module to extract the features from the documents. The features returned consists of three components. They are
This model also uses the Tesseract engine and Tensorflow/Pytorch libraries in backend to run the models.
Lets check the architecture of the Layout Language Model
Lets try to run the model
Install the following requirements.
!pip3 install transformers
!pip3 install pytesseract
!pip install Pillow==9.0.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
|████████████████████████████████| 5.8 MB 13.6 MB/s
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.8/dist-packages (from transformers) (1.21.6)
Collecting huggingface-hub<1.0,>=0.10.0
Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
|████████████████████████████████| 182 kB 16.2 MB/s
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from transformers) (21.3)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
|████████████████████████████████| 7.6 MB 9.7 MB/s
Requirement already satisfied: filelock in /usr/local/lib/python3.8/dist-packages (from transformers) (3.8.0)
Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from transformers) (2.23.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.8/dist-packages (from transformers) (2022.6.2)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.8/dist-packages (from transformers) (4.64.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.8/dist-packages (from transformers) (6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.8/dist-packages (from huggingface-hub<1.0,>=0.10.0->transformers) (4.4.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging>=20.0->transformers) (3.0.9)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (2022.9.24)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->transformers) (2.10)
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytesseract
Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Collecting Pillow>=8.0.0
Downloading Pillow-9.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
|████████████████████████████████| 3.2 MB 15.1 MB/s
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.8/dist-packages (from pytesseract) (21.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging>=21.3->pytesseract) (3.0.9)
Installing collected packages: Pillow, pytesseract
Attempting uninstall: Pillow
Found existing installation: Pillow 7.1.2
Uninstalling Pillow-7.1.2:
Successfully uninstalled Pillow-7.1.2
Successfully installed Pillow-9.3.0 pytesseract-0.3.10
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Pillow==9.0.0
Downloading Pillow-9.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB)
|████████████████████████████████| 4.3 MB 13.5 MB/s
Installing collected packages: Pillow
Attempting uninstall: Pillow
Found existing installation: Pillow 9.3.0
Uninstalling Pillow-9.3.0:
Successfully uninstalled Pillow-9.3.0
Successfully installed Pillow-9.0.0
Install teserract engine to the system.
!apt -qq install tesseract-ocr > /dev/null
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Need the following command if we are trying to use teserract engine for telugu language predictions
# !apt -qq install tesseract-ocr-tel > /dev/null
Import the necessary modules.
One of the important module is transformers which has inbuilt libraries of Layout Language Models
from pathlib import Path
from PIL import Image, ImageDraw
import numpy as np
from transformers import LayoutLMv3FeatureExtractor, LayoutLMv3TokenizerFast, LayoutLMv3Processor
I have stored my files in my Google Drive. So I need to mount the Drive to the colab notebook
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
ls "/content/drive/MyDrive/IUB Fall 22/Advance NLP/Project/data"
others/ resume_images/ wantok_images/
file_path = "/content/drive/MyDrive/IUB Fall 22/Advance NLP/Project/data"
We have google drive mounted and we can use the files to read.
Lets load one image of wantok_images.
images_path = list(Path(file_path + "/wantok_images").glob("*"))
image = Image.open(images_path[0])
Lets check te size of the image
image.size
(1417, 1986)
Lets see the loaded image
image
Now we have our image. Lets try to extract the features of the above image using LayoutLM FeatureExtractor
feature_extractor = LayoutLMv3FeatureExtractor(apply_ocr = True, ocr_lang='eng')
feature_extractor
LayoutLMv3ImageProcessor {
"apply_ocr": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.5,
0.5,
0.5
],
"image_processor_type": "LayoutLMv3ImageProcessor",
"image_std": [
0.5,
0.5,
0.5
],
"ocr_lang": "eng",
"resample": 2,
"rescale_factor": 0.00392156862745098,
"size": {
"height": 224,
"width": 224
},
"tesseract_config": ""
}
features = feature_extractor(image)
features.keys()
dict_keys(['pixel_values', 'words', 'boxes'])
print(f"Words: {features['words'][0]}")
print(f"Boxes: {features['boxes'][0]}")
print(f"Image pixels: {features['pixel_values'][0].shape}")
Words: ['Long', 'yia', '1968', 'dispela', 'Visiting', 'Misin', 'bilong', 'Yunaitet', 'Nesens', 'i', 'kamap', 'long', 'Niugini.', 'Hia', 'ol', 'i', 'stap', 'long', 'Rabaul', 'long', 'taim', 'bilong', 'ilek-', 'sen.', 'Nem', 'bilong', 'ol', 'em', 'hia', '(kirap', 'long', 'lephan):', 'Mista', 'A.', 'V.', 'Caine', '(Liberia),', 'Mista', 'J.M.', 'McEwan', '(Nu', 'Silan),', 'Mista', 'W.', 'Allen', '(Amerika),', 'na', 'Mista', 'P.', 'Gaschignard', '(Frans).', '(', 'D.I.E.S.poto', ')', 'Trinde,', 'Mas', '3,', '1971', '©1', 'memba', 'bilong', 'lain', 'bilong', 'Yunaitet', 'Nesens', 'i', 'kam', 'lukluk', 'raun', 'long', 'Teritori', 'long', 'yia', '1965.', 'Em', 'hia', 'ol', 'i', 'stap', 'long', 'Kompian.', '01', 'memba', 'i', 'sindaun', 'i', 'stap', '(ki-', 'rap', 'long', 'lephan):', 'Mista', 'Dermot', 'J.', 'Andre', 'Naudy', '(Frans),', 'Dwight', 'Dickinson', '(Amerika),', 'Nathaniel', 'Eastman', '(Liberia).', '(Yunaitet', 'Nesens', 'poto)', 'Swan', '(Englan),', '01', 'memba', 'bilong', '1971', 'Visiting', 'Misin', 'bilong', 'Yu-', 'naitet', 'Nesens', 'i', 'bung', 'insait', 'long', 'bus', 'long', 'Isten', 'Hailans.', 'Em', 'ol', 'hia', 'lephan}:', 'Sir', 'Denis', 'Allen', '(Englan),', 'Mista', 'Paul', 'Blanc', '(Frans),', 'na', 'Mista', 'Adnam', 'Raouf', '(Irak)', '(', 'D.I.E.S.', 'poto', ')', 'wanpela', 'haus', '(kirap', 'long']
Boxes: [[74, 786, 108, 795], [119, 786, 145, 795], [156, 786, 190, 795], [202, 787, 263, 795], [283, 787, 354, 796], [55, 798, 100, 805], [119, 798, 172, 807], [191, 798, 263, 806], [282, 799, 336, 806], [348, 799, 354, 806], [54, 810, 100, 818], [110, 810, 145, 818], [155, 810, 225, 819], [237, 810, 263, 818], [274, 811, 290, 818], [302, 810, 308, 818], [321, 811, 354, 819], [55, 821, 90, 830], [100, 822, 153, 829], [165, 822, 199, 831], [211, 822, 245, 829], [251, 822, 304, 831], [311, 823, 354, 830], [55, 835, 88, 840], [110, 833, 136, 840], [146, 834, 199, 843], [211, 834, 226, 841], [237, 836, 254, 841], [264, 834, 290, 842], [302, 833, 354, 843], [55, 845, 90, 854], [101, 844, 170, 854], [191, 846, 236, 854], [254, 846, 270, 854], [282, 846, 297, 854], [310, 846, 354, 854], [56, 855, 143, 867], [164, 858, 208, 866], [219, 858, 251, 866], [264, 858, 318, 866], [330, 857, 354, 866], [55, 868, 115, 879], [127, 870, 172, 877], [182, 870, 197, 877], [209, 870, 254, 877], [266, 869, 352, 880], [54, 883, 71, 888], [81, 881, 127, 889], [137, 882, 151, 889], [164, 882, 263, 891], [275, 881, 342, 890], [148, 892, 152, 901], [163, 894, 272, 903], [284, 893, 288, 902], [212, 148, 270, 158], [282, 149, 307, 155], [319, 149, 332, 156], [347, 149, 378, 155], [95, 455, 111, 462], [122, 456, 166, 463], [176, 456, 230, 465], [242, 456, 276, 463], [286, 456, 338, 465], [350, 457, 420, 464], [431, 457, 485, 464], [496, 458, 501, 464], [512, 458, 539, 465], [78, 467, 130, 474], [140, 470, 176, 475], [187, 468, 220, 476], [231, 468, 302, 475], [314, 468, 347, 477], [359, 468, 384, 477], [396, 468, 437, 477], [459, 469, 476, 476], [486, 469, 511, 476], [522, 470, 537, 476], [77, 479, 83, 486], [95, 479, 130, 488], [141, 479, 175, 488], [184, 480, 255, 488], [268, 480, 283, 487], [294, 480, 339, 487], [350, 480, 356, 487], [369, 481, 430, 488], [441, 481, 447, 488], [459, 481, 494, 490], [505, 479, 538, 489], [76, 493, 103, 500], [114, 491, 147, 500], [159, 490, 227, 501], [249, 492, 293, 499], [312, 492, 366, 499], [377, 492, 392, 500], [76, 503, 120, 510], [131, 503, 175, 512], [187, 502, 254, 513], [276, 504, 329, 513], [349, 505, 430, 512], [450, 503, 536, 514], [79, 516, 158, 523], [169, 517, 232, 524], [244, 516, 330, 526], [344, 517, 422, 526], [433, 519, 486, 527], [496, 519, 538, 529], [405, 492, 439, 500], [460, 491, 536, 502], [421, 834, 436, 841], [447, 835, 492, 842], [501, 835, 554, 844], [566, 835, 599, 844], [611, 835, 682, 845], [693, 836, 738, 843], [748, 837, 800, 845], [812, 837, 837, 844], [402, 846, 455, 854], [465, 847, 518, 854], [538, 847, 544, 854], [565, 847, 599, 856], [611, 847, 664, 855], [676, 848, 709, 857], [402, 859, 428, 866], [439, 859, 472, 868], [494, 859, 537, 866], [546, 859, 616, 867], [637, 860, 655, 867], [666, 860, 681, 867], [693, 860, 719, 868], [402, 869, 471, 879], [493, 871, 518, 878], [528, 871, 573, 878], [592, 872, 637, 879], [658, 870, 734, 881], [747, 872, 791, 880], [803, 872, 836, 880], [402, 882, 446, 890], [458, 881, 525, 892], [537, 885, 554, 890], [565, 883, 609, 891], [618, 883, 664, 891], [673, 884, 719, 891], [731, 882, 780, 892], [585, 894, 589, 903], [601, 895, 671, 903], [683, 896, 718, 905], [731, 894, 735, 904], [729, 849, 791, 858], [802, 849, 837, 856], [740, 859, 791, 869], [803, 861, 836, 869]]
Image pixels: (3, 224, 224)
From above output we can see some words are predicted. Let's concat all the words see the entire text
predicted_ocr = ""
for word in features['words'][0]:
predicted_ocr += word + " "
Let's see the predicted ocr
predicted_ocr
'Long yia 1968 dispela Visiting Misin bilong Yunaitet Nesens i kamap long Niugini. Hia ol i stap long Rabaul long taim bilong ilek- sen. Nem bilong ol em hia (kirap long lephan): Mista A. V. Caine (Liberia), Mista J.M. McEwan (Nu Silan), Mista W. Allen (Amerika), na Mista P. Gaschignard (Frans). ( D.I.E.S.poto ) Trinde, Mas 3, 1971 ©1 memba bilong lain bilong Yunaitet Nesens i kam lukluk raun long Teritori long yia 1965. Em hia ol i stap long Kompian. 01 memba i sindaun i stap (ki- rap long lephan): Mista Dermot J. Andre Naudy (Frans), Dwight Dickinson (Amerika), Nathaniel Eastman (Liberia). (Yunaitet Nesens poto) Swan (Englan), 01 memba bilong 1971 Visiting Misin bilong Yu- naitet Nesens i bung insait long bus long Isten Hailans. Em ol hia lephan}: Sir Denis Allen (Englan), Mista Paul Blanc (Frans), na Mista Adnam Raouf (Irak) ( D.I.E.S. poto ) wanpela haus (kirap long '
From above we can see some text got extracted from the image.
Now lets try to plot the boundary boxes of these text
image = Image.open(images_path[0])
draw = ImageDraw.Draw(image)
width_scale = image.width/1000
height_scale = image.height/1000
for boundary_box in features['boxes'][0]:
draw.rectangle([boundary_box[0] * width_scale, boundary_box[1] * height_scale,
boundary_box[2] * width_scale, boundary_box[3]* height_scale],
outline = 'red', width=2)
image
From above output we can clearly see the boxes around the text.
Now lets check other ways to extract the text from the same image
Paddle OCR is one of the open source library which is practical ultra-lightweight pre-trained model, support training and deployment among server, mobile, embedded and IoT devices.
Paddle OCR is mainly designed and trained on recognising the Chinese and English character recognition. But the proposed model is also verified in several language recognition tasks like French, Korean, Japanese and German.
As mentioned it is very light weighted and can be used with or without GPU. It returns the three output components which are
!python3 -m pip install paddlepaddle-gpu==2.1.3.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
Collecting paddlepaddle-gpu==2.1.3.post112
Downloading https://paddle-wheel.bj.bcebos.com/2.1.3/linux/linux-gpu-cuda11.2-cudnn8-mkl-gcc8.2-avx/paddlepaddle_gpu-2.1.3.post112-cp38-cp38-linux_x86_64.whl (349.8 MB)
|████████████████████████████████| 349.8 MB 975 bytes/s
Requirement already satisfied: Pillow in /usr/local/lib/python3.8/dist-packages (from paddlepaddle-gpu==2.1.3.post112) (9.0.0)
Requirement already satisfied: decorator in /usr/local/lib/python3.8/dist-packages (from paddlepaddle-gpu==2.1.3.post112) (4.4.2)
Requirement already satisfied: gast<=0.4.0,>=0.3.3 in /usr/local/lib/python3.8/dist-packages (from paddlepaddle-gpu==2.1.3.post112) (0.4.0)
Requirement already satisfied: astor in /usr/local/lib/python3.8/dist-packages (from paddlepaddle-gpu==2.1.3.post112) (0.8.1)
Requirement already satisfied: numpy>=1.13 in /usr/local/lib/python3.8/dist-packages (from paddlepaddle-gpu==2.1.3.post112) (1.21.6)
Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from paddlepaddle-gpu==2.1.3.post112) (1.15.0)
Requirement already satisfied: protobuf>=3.1.0 in /usr/local/lib/python3.8/dist-packages (from paddlepaddle-gpu==2.1.3.post112) (3.19.6)
Requirement already satisfied: requests>=2.20.0 in /usr/local/lib/python3.8/dist-packages (from paddlepaddle-gpu==2.1.3.post112) (2.23.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests>=2.20.0->paddlepaddle-gpu==2.1.3.post112) (2022.9.24)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests>=2.20.0->paddlepaddle-gpu==2.1.3.post112) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests>=2.20.0->paddlepaddle-gpu==2.1.3.post112) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests>=2.20.0->paddlepaddle-gpu==2.1.3.post112) (3.0.4)
Installing collected packages: paddlepaddle-gpu
Successfully installed paddlepaddle-gpu-2.1.3.post112
!pip install "paddleocr>=2.0.1"
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting paddleocr>=2.0.1
Downloading paddleocr-2.6.1.1-py3-none-any.whl (411 kB)
|████████████████████████████████| 411 kB 14.9 MB/s
Collecting fonttools>=4.24.0
Downloading fonttools-4.38.0-py3-none-any.whl (965 kB)
|████████████████████████████████| 965 kB 71.4 MB/s
Requirement already satisfied: imgaug in /usr/local/lib/python3.8/dist-packages (from paddleocr>=2.0.1) (0.4.0)
Collecting premailer
Downloading premailer-3.10.0-py2.py3-none-any.whl (19 kB)
Collecting python-docx
Downloading python-docx-0.8.11.tar.gz (5.6 MB)
|████████████████████████████████| 5.6 MB 62.4 MB/s
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.8/dist-packages (from paddleocr>=2.0.1) (4.6.3)
Collecting pyclipper
Downloading pyclipper-1.3.0.post4-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (619 kB)
|████████████████████████████████| 619 kB 72.1 MB/s
Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from paddleocr>=2.0.1) (4.64.1)
Requirement already satisfied: lxml in /usr/local/lib/python3.8/dist-packages (from paddleocr>=2.0.1) (4.9.1)
Collecting fire>=0.3.0
Downloading fire-0.4.0.tar.gz (87 kB)
|████████████████████████████████| 87 kB 5.8 MB/s
Collecting rapidfuzz
Downloading rapidfuzz-2.13.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
|████████████████████████████████| 2.2 MB 63.1 MB/s
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from paddleocr>=2.0.1) (1.21.6)
Collecting visualdl
Downloading visualdl-2.4.1-py3-none-any.whl (4.9 MB)
|████████████████████████████████| 4.9 MB 64.1 MB/s
Requirement already satisfied: scikit-image in /usr/local/lib/python3.8/dist-packages (from paddleocr>=2.0.1) (0.18.3)
Requirement already satisfied: opencv-contrib-python in /usr/local/lib/python3.8/dist-packages (from paddleocr>=2.0.1) (4.6.0.66)
Collecting PyMuPDF<1.21.0
Downloading PyMuPDF-1.20.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.8 MB)
|████████████████████████████████| 8.8 MB 25.8 MB/s
Requirement already satisfied: shapely in /usr/local/lib/python3.8/dist-packages (from paddleocr>=2.0.1) (1.8.5.post1)
Requirement already satisfied: cython in /usr/local/lib/python3.8/dist-packages (from paddleocr>=2.0.1) (0.29.32)
Collecting attrdict
Downloading attrdict-2.0.1-py2.py3-none-any.whl (9.9 kB)
Requirement already satisfied: opencv-python in /usr/local/lib/python3.8/dist-packages (from paddleocr>=2.0.1) (4.6.0.66)
Requirement already satisfied: openpyxl in /usr/local/lib/python3.8/dist-packages (from paddleocr>=2.0.1) (3.0.10)
Requirement already satisfied: lmdb in /usr/local/lib/python3.8/dist-packages (from paddleocr>=2.0.1) (0.99)
Collecting pdf2docx
Downloading pdf2docx-0.5.6-py3-none-any.whl (148 kB)
|████████████████████████████████| 148 kB 81.1 MB/s
Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from fire>=0.3.0->paddleocr>=2.0.1) (1.15.0)
Requirement already satisfied: termcolor in /usr/local/lib/python3.8/dist-packages (from fire>=0.3.0->paddleocr>=2.0.1) (2.1.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.8/dist-packages (from imgaug->paddleocr>=2.0.1) (3.2.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (from imgaug->paddleocr>=2.0.1) (1.7.3)
Requirement already satisfied: imageio in /usr/local/lib/python3.8/dist-packages (from imgaug->paddleocr>=2.0.1) (2.9.0)
Requirement already satisfied: Pillow in /usr/local/lib/python3.8/dist-packages (from imgaug->paddleocr>=2.0.1) (9.0.0)
Requirement already satisfied: tifffile>=2019.7.26 in /usr/local/lib/python3.8/dist-packages (from scikit-image->paddleocr>=2.0.1) (2022.10.10)
Requirement already satisfied: PyWavelets>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from scikit-image->paddleocr>=2.0.1) (1.4.1)
Requirement already satisfied: networkx>=2.0 in /usr/local/lib/python3.8/dist-packages (from scikit-image->paddleocr>=2.0.1) (2.8.8)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->imgaug->paddleocr>=2.0.1) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->imgaug->paddleocr>=2.0.1) (3.0.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->imgaug->paddleocr>=2.0.1) (1.4.4)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages (from matplotlib->imgaug->paddleocr>=2.0.1) (0.11.0)
Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.8/dist-packages (from openpyxl->paddleocr>=2.0.1) (1.1.0)
Collecting cssutils
Downloading cssutils-2.6.0-py3-none-any.whl (399 kB)
|████████████████████████████████| 399 kB 64.0 MB/s
Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from premailer->paddleocr>=2.0.1) (2.23.0)
Requirement already satisfied: cachetools in /usr/local/lib/python3.8/dist-packages (from premailer->paddleocr>=2.0.1) (5.2.0)
Collecting cssselect
Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests->premailer->paddleocr>=2.0.1) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->premailer->paddleocr>=2.0.1) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests->premailer->paddleocr>=2.0.1) (2022.9.24)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->premailer->paddleocr>=2.0.1) (3.0.4)
Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from visualdl->paddleocr>=2.0.1) (21.3)
Requirement already satisfied: flask>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from visualdl->paddleocr>=2.0.1) (1.1.4)
Requirement already satisfied: protobuf>=3.11.0 in /usr/local/lib/python3.8/dist-packages (from visualdl->paddleocr>=2.0.1) (3.19.6)
Collecting Flask-Babel>=1.0.0
Downloading Flask_Babel-2.0.0-py3-none-any.whl (9.3 kB)
Collecting bce-python-sdk
Downloading bce_python_sdk-0.8.74-py3-none-any.whl (204 kB)
|████████████████████████████████| 204 kB 79.9 MB/s
Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages (from visualdl->paddleocr>=2.0.1) (1.3.5)
Collecting multiprocess
Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
|████████████████████████████████| 132 kB 81.8 MB/s
Requirement already satisfied: Jinja2<3.0,>=2.10.1 in /usr/local/lib/python3.8/dist-packages (from flask>=1.1.1->visualdl->paddleocr>=2.0.1) (2.11.3)
Requirement already satisfied: click<8.0,>=5.1 in /usr/local/lib/python3.8/dist-packages (from flask>=1.1.1->visualdl->paddleocr>=2.0.1) (7.1.2)
Requirement already satisfied: Werkzeug<2.0,>=0.15 in /usr/local/lib/python3.8/dist-packages (from flask>=1.1.1->visualdl->paddleocr>=2.0.1) (1.0.1)
Requirement already satisfied: itsdangerous<2.0,>=0.24 in /usr/local/lib/python3.8/dist-packages (from flask>=1.1.1->visualdl->paddleocr>=2.0.1) (1.1.0)
Requirement already satisfied: Babel>=2.3 in /usr/local/lib/python3.8/dist-packages (from Flask-Babel>=1.0.0->visualdl->paddleocr>=2.0.1) (2.11.0)
Requirement already satisfied: pytz in /usr/local/lib/python3.8/dist-packages (from Flask-Babel>=1.0.0->visualdl->paddleocr>=2.0.1) (2022.6)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.8/dist-packages (from Jinja2<3.0,>=2.10.1->flask>=1.1.1->visualdl->paddleocr>=2.0.1) (2.0.1)
Requirement already satisfied: future>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from bce-python-sdk->visualdl->paddleocr>=2.0.1) (0.16.0)
Collecting pycryptodome>=3.8.0
Downloading pycryptodome-3.16.0-cp35-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.3 MB)
|████████████████████████████████| 2.3 MB 56.9 MB/s
Requirement already satisfied: dill>=0.3.6 in /usr/local/lib/python3.8/dist-packages (from multiprocess->visualdl->paddleocr>=2.0.1) (0.3.6)
Building wheels for collected packages: fire, python-docx
Building wheel for fire (setup.py) ... done
Created wheel for fire: filename=fire-0.4.0-py2.py3-none-any.whl size=115943 sha256=d85423d86738210785546d0379f6a93993bdba3d194acfe1260c74074c9578e9
Stored in directory: /root/.cache/pip/wheels/1f/10/06/2a990ee4d73a8479fe2922445e8a876d38cfbfed052284c6a1
Building wheel for python-docx (setup.py) ... done
Created wheel for python-docx: filename=python_docx-0.8.11-py3-none-any.whl size=184505 sha256=52a856de91114d255581d87c237270734c0e4fe36b4eddf47f1e3d962ebcc7a8
Stored in directory: /root/.cache/pip/wheels/32/b8/b2/c4c2b95765e615fe139b0b17b5ea7c0e1b6519b0a9ec8fb34d
Successfully built fire python-docx
Installing collected packages: pycryptodome, python-docx, PyMuPDF, multiprocess, fonttools, Flask-Babel, fire, cssutils, cssselect, bce-python-sdk, visualdl, rapidfuzz, pyclipper, premailer, pdf2docx, attrdict, paddleocr
Successfully installed Flask-Babel-2.0.0 PyMuPDF-1.20.2 attrdict-2.0.1 bce-python-sdk-0.8.74 cssselect-1.2.0 cssutils-2.6.0 fire-0.4.0 fonttools-4.38.0 multiprocess-0.70.14 paddleocr-2.6.1.1 pdf2docx-0.5.6 premailer-3.10.0 pyclipper-1.3.0.post4 pycryptodome-3.16.0 python-docx-0.8.11 rapidfuzz-2.13.3 visualdl-2.4.1
Once all the requirements are installed lets create the PaddleOCR API object
from paddleocr import PaddleOCR, draw_ocr
ocr = PaddleOCR(use_angle_cls=True, lang='en')
[2022/12/07 23:53:15] ppocr DEBUG: Namespace(alpha=1.0, benchmark=False, beta=1.0, cls_batch_num=6, cls_image_shape='3, 48, 192', cls_model_dir='/root/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer', cls_thresh=0.9, cpu_threads=10, crop_res_save_dir='./output', det=True, det_algorithm='DB', det_box_type='quad', det_db_box_thresh=0.6, det_db_score_mode='fast', det_db_thresh=0.3, det_db_unclip_ratio=1.5, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_east_score_thresh=0.8, det_limit_side_len=960, det_limit_type='max', det_model_dir='/root/.paddleocr/whl/det/en/en_PP-OCRv3_det_infer', det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, det_pse_thresh=0, det_sast_nms_thresh=0.2, det_sast_score_thresh=0.5, draw_img_save_dir='./inference_results', drop_score=0.5, e2e_algorithm='PGNet', e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_limit_side_len=768, e2e_limit_type='max', e2e_model_dir=None, e2e_pgnet_mode='fast', e2e_pgnet_score_thresh=0.5, e2e_pgnet_valid_set='totaltext', enable_mkldnn=False, fourier_degree=5, gpu_mem=500, help='==SUPPRESS==', image_dir=None, image_orientation=False, ir_optim=True, kie_algorithm='LayoutXLM', label_list=['0', '180'], lang='en', layout=True, layout_dict_path=None, layout_model_dir=None, layout_nms_threshold=0.5, layout_score_threshold=0.5, max_batch_size=10, max_text_length=25, merge_no_span_structure=True, min_subgraph_size=15, mode='structure', ocr=True, ocr_order_method=None, ocr_version='PP-OCRv3', output='./output', page_num=0, precision='fp32', process_id=0, re_model_dir=None, rec=True, rec_algorithm='SVTR_LCNet', rec_batch_num=6, rec_char_dict_path='/usr/local/lib/python3.8/dist-packages/paddleocr/ppocr/utils/en_dict.txt', rec_image_inverse=True, rec_image_shape='3, 48, 320', rec_model_dir='/root/.paddleocr/whl/rec/en/en_PP-OCRv3_rec_infer', recovery=False, save_crop_res=False, save_log_path='./log_output/', scales=[8, 16, 32], ser_dict_path='../train_data/XFUND/class_list_xfun.txt', ser_model_dir=None, show_log=True, sr_batch_num=1, sr_image_shape='3, 32, 128', sr_model_dir=None, structure_version='PP-StructureV2', table=True, table_algorithm='TableAttn', table_char_dict_path=None, table_max_len=488, table_model_dir=None, total_process_num=1, type='ocr', use_angle_cls=True, use_dilation=False, use_gpu=True, use_mp=False, use_npu=False, use_onnx=False, use_pdf2docx_api=False, use_pdserving=False, use_space_char=True, use_tensorrt=False, use_visual_backbone=True, use_xpu=False, vis_font_path='./doc/fonts/simfang.ttf', warmup=False)
As we are using google colab, it doesn't support the cv2 module directly. So I need to import proper package.
Also as I mentioned i stored my files in Google Drive I need to mount Google Drive.
from google.colab.patches import cv2_imshow
import cv2
from pathlib import Path
from PIL import Image, ImageDraw
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
ls "/content/drive/MyDrive/IUB Fall 22/Advance NLP/Project/data"
others/ resume_images/ wantok_images/
file_path = "/content/drive/MyDrive/IUB Fall 22/Advance NLP/Project/data"
images_path = list(Path(file_path + "/wantok_images").glob("*"))
image = Image.open(images_path[0])
print(str(images_path[0]))
/content/drive/MyDrive/IUB Fall 22/Advance NLP/Project/data/wantok_images/Wantok_namba_15_page-0001.jpg
We have mounted google drive and got the files from drive.
Lets load one image and view once.
img_path = str(images_path[0])
img = cv2.imread(img_path, cv2.IMREAD_UNCHANGED)
cv2_imshow(img)
We can view the image clearly. Now lets try to extract the text from this image.
We have to use the OCR package to read the image and extract the text
# %timeit result = ocr.ocr(img_path, cls=True)
result = ocr.ocr(img_path, cls=True)
[2022/12/07 23:53:25] ppocr DEBUG: dt_boxes num : 37, elapse : 5.290798187255859 [2022/12/07 23:53:25] ppocr DEBUG: cls num : 37, elapse : 0.14812207221984863 [2022/12/07 23:53:25] ppocr DEBUG: rec_res num : 37, elapse : 0.18934869766235352
Now we got the result from the OCR. Lets try to print the each line from the result
for line in result[0]:
print(line)
[[[63.0, 130.0], [641.0, 130.0], [641.0, 267.0], [63.0, 267.0]], ('WANTUK', 0.8914305567741394)]
[[[41.0, 289.0], [144.0, 294.0], [143.0, 317.0], [40.0, 312.0]], ('NAMBA 15', 0.9962493181228638)]
[[[295.0, 287.0], [540.0, 290.0], [540.0, 312.0], [295.0, 310.0]], ('Trinde, Mas 3, 1971', 0.9390543699264526)]
[[[686.0, 292.0], [806.0, 294.0], [805.0, 317.0], [685.0, 314.0]], ('Prais 10c', 0.9112671613693237)]
[[[838.0, 323.0], [1230.0, 327.0], [1229.0, 385.0], [837.0, 380.0]], ('TUNAITET NESENS', 0.8078736066818237)]
[[[129.0, 898.0], [770.0, 904.0], [769.0, 931.0], [129.0, 925.0]], ('Ol memba bilong lain bilong Yunaitet Nesens i kam', 0.9855824708938599)]
[[[103.0, 921.0], [768.0, 925.0], [767.0, 954.0], [103.0, 950.0]], ('lukluk raun long Teritori long yia i965. Em hia ol', 0.9799408912658691)]
[[[103.0, 945.0], [768.0, 950.0], [767.0, 979.0], [103.0, 974.0]], ('i stap long Kompian. Ol memba i sindaun i stap (ki-', 0.9804458022117615)]
[[[103.0, 970.0], [635.0, 972.0], [635.0, 999.0], [103.0, 997.0]], ('rap long lephan): Mista Dermot J. Swan', 0.9864988923072815)]
[[[639.0, 974.0], [764.0, 977.0], [763.0, 999.0], [639.0, 997.0]], ('(Englan),', 0.9901745319366455)]
[[[105.0, 995.0], [369.0, 995.0], [369.0, 1022.0], [105.0, 1022.0]], ('Andre Naudy (Frans),', 0.9781529307365417)]
[[[380.0, 999.0], [479.0, 999.0], [479.0, 1020.0], [380.0, 1020.0]], ('Dwight', 0.9990332126617432)]
[[[472.0, 999.0], [612.0, 999.0], [612.0, 1020.0], [472.0, 1020.0]], ('Dickinson', 0.9994150996208191)]
[[[635.0, 997.0], [764.0, 1002.0], [763.0, 1024.0], [634.0, 1020.0]], ('(Amerika),', 0.9789139032363892)]
[[[106.0, 1018.0], [768.0, 1024.0], [767.0, 1053.0], [105.0, 1047.0]], ('Nathaniel Eastman (Liberia). (Yunaitet Nesens poto)', 0.9933131337165833)]
[[[99.0, 1554.0], [382.0, 1558.0], [381.0, 1585.0], [99.0, 1580.0]], ('Long yia 1968 dispela', 0.9927085041999817)]
[[[386.0, 1558.0], [506.0, 1560.0], [506.0, 1583.0], [386.0, 1580.0]], ('Visiting', 0.9984325170516968)]
[[[74.0, 1580.0], [386.0, 1583.0], [386.0, 1605.0], [74.0, 1603.0]], ('Misin bilong Yunaitet', 0.9978932738304138)]
[[[382.0, 1585.0], [506.0, 1585.0], [506.0, 1605.0], [382.0, 1605.0]], (' Nesens i', 0.8945876359939575)]
[[[72.0, 1605.0], [506.0, 1607.0], [506.0, 1630.0], [72.0, 1628.0]], ('kamap long Niugini. Hia ol i stap', 0.960042417049408)]
[[[76.0, 1626.0], [506.0, 1630.0], [506.0, 1657.0], [76.0, 1653.0]], ('long Rabaul long taim bilong ilek-', 0.9953529238700867)]
[[[80.0, 1657.0], [133.0, 1657.0], [133.0, 1676.0], [80.0, 1676.0]], ('sen.', 0.9954148530960083)]
[[[589.0, 1649.0], [1189.0, 1655.0], [1189.0, 1684.0], [588.0, 1678.0]], ('Ol memba bilong I97l Visiting Misin biIong Yu-', 0.9729275703430176)]
[[[78.0, 1678.0], [253.0, 1678.0], [253.0, 1698.0], [78.0, 1698.0]], ('long lephan):', 0.983543336391449)]
[[[258.0, 1675.0], [346.0, 1680.0], [345.0, 1701.0], [257.0, 1696.0]], ('Mista', 0.9976465106010437)]
[[[348.0, 1676.0], [504.0, 1678.0], [504.0, 1701.0], [348.0, 1698.0]], ('A. V. Caine', 0.9549927115440369)]
[[[563.0, 1676.0], [1191.0, 1682.0], [1191.0, 1709.0], [563.0, 1702.0]], ('naitet Nesens i bung insait long wanpela haus', 0.993631899356842)]
[[[76.0, 1696.0], [506.0, 1698.0], [506.0, 1725.0], [76.0, 1723.0]], ('(Liberia), Mista J.M. McEwan (Nu', 0.9585366249084473)]
[[[566.0, 1700.0], [686.0, 1705.0], [685.0, 1728.0], [565.0, 1723.0]], ('bus iong', 0.93471360206604)]
[[[675.0, 1703.0], [1027.0, 1705.0], [1027.0, 1727.0], [675.0, 1725.0]], (' Isten Hailans. Em ol hia', 0.9801034331321716)]
[[[1040.0, 1705.0], [1189.0, 1707.0], [1189.0, 1730.0], [1039.0, 1727.0]], ('(kirap long', 0.9977756142616272)]
[[[74.0, 1719.0], [504.0, 1723.0], [504.0, 1752.0], [74.0, 1748.0]], ('Silan), Mista W. Allen (Amerika),', 0.9812512397766113)]
[[[567.0, 1727.0], [677.0, 1727.0], [677.0, 1748.0], [567.0, 1748.0]], ('lephan):', 0.9847099184989929)]
[[[70.0, 1744.0], [491.0, 1748.0], [491.0, 1775.0], [69.0, 1771.0]], ('na Mista P. Gaschignard (Frans).', 0.9989610910415649)]
[[[563.0, 1746.0], [1111.0, 1750.0], [1111.0, 1779.0], [563.0, 1775.0]], ('Blanc (Frans), na Mista Adnam Raouf (Irak)', 0.9828816652297974)]
[[[203.0, 1771.0], [413.0, 1775.0], [413.0, 1796.0], [202.0, 1791.0]], ('( D.I.E.S.poto )', 0.9483118057250977)]
[[[818.0, 1771.0], [1050.0, 1775.0], [1050.0, 1802.0], [818.0, 1798.0]], ('( D.I.E.S. poto )', 0.8332781195640564)]
Now lets see the entire concatenated text
predicted_text = ""
for line in result[0]:
print(line[1][0])
predicted_text += line[1][0]
with open("ocr_predicted.txt", "w") as file:
file.write(predicted_text)
print(predicted_text)
WANTUK NAMBA 15 Trinde, Mas 3, 1971 Prais 10c TUNAITET NESENS Ol memba bilong lain bilong Yunaitet Nesens i kam lukluk raun long Teritori long yia i965. Em hia ol i stap long Kompian. Ol memba i sindaun i stap (ki- rap long lephan): Mista Dermot J. Swan (Englan), Andre Naudy (Frans), Dwight Dickinson (Amerika), Nathaniel Eastman (Liberia). (Yunaitet Nesens poto) Long yia 1968 dispela Visiting Misin bilong Yunaitet Nesens i kamap long Niugini. Hia ol i stap long Rabaul long taim bilong ilek- sen. Ol memba bilong I97l Visiting Misin biIong Yu- long lephan): Mista A. V. Caine naitet Nesens i bung insait long wanpela haus (Liberia), Mista J.M. McEwan (Nu bus iong Isten Hailans. Em ol hia (kirap long Silan), Mista W. Allen (Amerika), lephan): na Mista P. Gaschignard (Frans). Blanc (Frans), na Mista Adnam Raouf (Irak) ( D.I.E.S.poto ) ( D.I.E.S. poto ) WANTUKNAMBA 15Trinde, Mas 3, 1971Prais 10cTUNAITET NESENSOl memba bilong lain bilong Yunaitet Nesens i kamlukluk raun long Teritori long yia i965. Em hia oli stap long Kompian. Ol memba i sindaun i stap (ki-rap long lephan): Mista Dermot J. Swan(Englan),Andre Naudy (Frans),DwightDickinson(Amerika),Nathaniel Eastman (Liberia). (Yunaitet Nesens poto)Long yia 1968 dispelaVisitingMisin bilong Yunaitet Nesens ikamap long Niugini. Hia ol i staplong Rabaul long taim bilong ilek-sen.Ol memba bilong I97l Visiting Misin biIong Yu-long lephan):MistaA. V. Cainenaitet Nesens i bung insait long wanpela haus(Liberia), Mista J.M. McEwan (Nubus iong Isten Hailans. Em ol hia(kirap longSilan), Mista W. Allen (Amerika),lephan):na Mista P. Gaschignard (Frans).Blanc (Frans), na Mista Adnam Raouf (Irak)( D.I.E.S.poto )( D.I.E.S. poto )
From above output cell we can see the extracted text.
Now lets try to plot the boxes around the text.
# draw result
from PIL import Image, ImageDraw, ImageFont
image = Image.open(img_path).convert('RGB')
boxes = [line[0] for line in result[0]]
txts = [line[1][0] for line in result[0]]
scores = [line[1][1] for line in result[0]]
font = ImageFont.load_default()
print(scores)
print(len(boxes))
im_show = draw_ocr(image, boxes, txts, scores,font_path='/usr/share/fonts/truetype/humor-sans/Humor-Sans.ttf' )
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')
[0.8914305567741394, 0.9962493181228638, 0.9390543699264526, 0.9112671613693237, 0.8078736066818237, 0.9855824708938599, 0.9799408912658691, 0.9804458022117615, 0.9864988923072815, 0.9901745319366455, 0.9781529307365417, 0.9990332126617432, 0.9994150996208191, 0.9789139032363892, 0.9933131337165833, 0.9927085041999817, 0.9984325170516968, 0.9978932738304138, 0.8945876359939575, 0.960042417049408, 0.9953529238700867, 0.9954148530960083, 0.9729275703430176, 0.983543336391449, 0.9976465106010437, 0.9549927115440369, 0.993631899356842, 0.9585366249084473, 0.93471360206604, 0.9801034331321716, 0.9977756142616272, 0.9812512397766113, 0.9847099184989929, 0.9989610910415649, 0.9828816652297974, 0.9483118057250977, 0.8332781195640564] 37
Lets see the output image
img = cv2.imread('/content/result.jpg', cv2.IMREAD_UNCHANGED)
cv2_imshow(img)
From above image we can clearly see that the text is extracted from the image and also we can see the boundary boxes. In addition, it is also giving the each word and its score so that how accurate the word is.
The below few cells are some benchmarks files. I have run the paddleocr on my Resume to see the results more accurate.
img_path = '/content/resume.png'
img = cv2.imread(img_path, cv2.IMREAD_UNCHANGED)
cv2_imshow(img)
# %timeit result = ocr.ocr(img_path, cls=True)
result = ocr.ocr(img_path, cls=True)
[2022/12/08 00:00:34] ppocr DEBUG: dt_boxes num : 40, elapse : 0.09183430671691895 [2022/12/08 00:00:34] ppocr DEBUG: cls num : 40, elapse : 0.07754826545715332 [2022/12/08 00:00:34] ppocr DEBUG: rec_res num : 40, elapse : 0.3118305206298828
for line in result[0]:
print(line)
[[[614.0, 213.0], [1088.0, 213.0], [1088.0, 252.0], [614.0, 252.0]], ('Umesh Kumar Gattem', 0.9807722568511963)]
[[[370.0, 277.0], [1333.0, 277.0], [1333.0, 309.0], [370.0, 309.0]], ('AI Engineer | Graduate student at Indiana University Bloomington', 0.9907920360565186)]
[[[245.0, 323.0], [1455.0, 323.0], [1455.0, 353.0], [245.0, 353.0]], ('Blomington, IN, 47408 |Mobile : +1 (812)-272-5756 |EMail : umesh.gattem@gmail.com', 0.9905869364738464)]
[[[143.0, 367.0], [1589.0, 367.0], [1589.0, 399.0], [143.0, 399.0]], ('Portfolio : https://umesh-gattem.github.iol | Linkedin: https://www.linkedin.com/in/umesh-kumar-8b9142a8/', 0.9889769554138184)]
[[[102.0, 424.0], [323.0, 424.0], [323.0, 456.0], [102.0, 456.0]], ('EDUCATION :', 0.9960023760795593)]
[[[99.0, 470.0], [596.0, 470.0], [596.0, 502.0], [99.0, 502.0]], ('Indiana University of Bloomington', 0.9906994104385376)]
[[[1280.0, 470.0], [1587.0, 470.0], [1587.0, 502.0], [1280.0, 502.0]], ('Aug 2021 - May 2023', 0.9983083009719849)]
[[[99.0, 513.0], [610.0, 513.0], [610.0, 545.0], [99.0, 545.0]], ("Master 's in Data Science | (GPA : 3.5)", 0.9688986539840698)]
[[[99.0, 555.0], [1118.0, 559.0], [1118.0, 591.0], [99.0, 587.0]], ('Anil Neerukonda Institute of Technology and Sciences , Visakhapatnam', 0.9901943206787109)]
[[[1270.0, 555.0], [1589.0, 552.0], [1589.0, 591.0], [1271.0, 594.0]], ('Aug 2012 - April 2016', 0.9949930906295776)]
[[[99.0, 603.0], [1007.0, 603.0], [1007.0, 635.0], [99.0, 635.0]], ("Bachelor 's in Computer Science and Engineering | (GPA : 8.05/10)", 0.9746212363243103)]
[[[102.0, 690.0], [430.0, 690.0], [430.0, 720.0], [102.0, 720.0]], ('TECHNICAL SKILLS', 0.999392569065094)]
[[[97.0, 731.0], [755.0, 729.0], [755.0, 768.0], [97.0, 770.0]], ('Languages - Python, C, C++, Java, HTML, CSS', 0.9921110272407532)]
[[[99.0, 775.0], [1458.0, 779.0], [1457.0, 811.0], [99.0, 807.0]], ('AI Courses - Neural Networks, Deep Learning, Machine Learning, Data Visualizations, Data Mining', 0.9950522184371948)]
[[[99.0, 823.0], [1464.0, 823.0], [1464.0, 855.0], [99.0, 855.0]], ('Frameworks - TensorFlow, PyTorch, Keras, Sci-kit, Pandas, Numpy, Flask Python, FastAPI, Uvicorn', 0.9903169870376587)]
[[[102.0, 869.0], [1254.0, 869.0], [1254.0, 901.0], [102.0, 901.0]], ('Visualization tools - Matplotlib, Seaborn, Plotly, Word Cloud, Geopy, Folium, Bokeh.', 0.9906477928161621)]
[[[99.0, 912.0], [744.0, 912.0], [744.0, 944.0], [99.0, 944.0]], ('Databases - Mysql, Postgres, MongoDB, Sqlite', 0.9962514638900757)]
[[[102.0, 956.0], [1206.0, 956.0], [1206.0, 988.0], [102.0, 988.0]], ('Project Management and Tools - Git, JIRA, Confluence, Slack, Pycharm, IntelliJ', 0.998393714427948)]
[[[102.0, 1043.0], [584.0, 1043.0], [584.0, 1072.0], [102.0, 1072.0]], ('PROFESSIONAL EXPERIENCE', 0.9987578392028809)]
[[[99.0, 1086.0], [1056.0, 1086.0], [1056.0, 1118.0], [99.0, 1118.0]], ('RAZORTHINK TECHNOLOGIES, Bangalore, India - AI Engineer', 0.984175443649292)]
[[[1282.0, 1086.0], [1580.0, 1086.0], [1580.0, 1118.0], [1282.0, 1118.0]], ('July 2016 - July 2021', 0.9995238780975342)]
[[[122.0, 1132.0], [1582.0, 1132.0], [1582.0, 1164.0], [122.0, 1164.0]], (': Part of the Product team called Razorthink AI Platform which lets the data scientists and analysts visually', 0.9818212389945984)]
[[[173.0, 1176.0], [1125.0, 1176.0], [1125.0, 1208.0], [173.0, 1208.0]], ('build data transformation recipes, DL models and end-to-end pipelines.', 0.9923768043518066)]
[[[166.0, 1217.0], [1488.0, 1221.0], [1487.0, 1254.0], [166.0, 1249.0]], ('Worked on different modelling libraries like Data Mining, Data Visualizations, Transfer Learning.', 0.9865124821662903)]
[[[173.0, 1263.0], [1319.0, 1265.0], [1319.0, 1297.0], [173.0, 1295.0]], ('Training and Inferring models ,Tensorboard, TFHUB models, Distributed Training, :', 0.991465151309967)]
[[[169.0, 1309.0], [1474.0, 1309.0], [1474.0, 1341.0], [169.0, 1341.0]], ('Worked on RZTDL Library which is a patented deep learning framework built on top of different', 0.9956429600715637)]
[[[173.0, 1352.0], [1554.0, 1352.0], [1554.0, 1384.0], [173.0, 1384.0]], ('backends like Tensorflow and Pytorch which supports different operations like CNN, RNN, LSTM etc', 0.9910761713981628)]
[[[164.0, 1398.0], [1511.0, 1398.0], [1511.0, 1430.0], [164.0, 1430.0]], ('Implemented different Python SDK(Software Development Kit) Libraries as part of the AI platform', 0.9935489892959595)]
[[[102.0, 1483.0], [296.0, 1483.0], [296.0, 1512.0], [102.0, 1512.0]], ('PROJECTS :', 0.9993253946304321)]
[[[102.0, 1529.0], [926.0, 1529.0], [926.0, 1561.0], [102.0, 1561.0]], ('IPL Dataset Visualization - Github | Slides |Final Report', 0.9742147326469421)]
[[[1298.0, 1526.0], [1578.0, 1526.0], [1578.0, 1556.0], [1298.0, 1556.0]], ('Oct 2021 - Dec 2021', 0.9925193786621094)]
[[[162.0, 1570.0], [1534.0, 1574.0], [1534.0, 1606.0], [162.0, 1602.0]], ('Worked on IPL Dataset (2008-2020) and visualized different scenarios like Season Stats, Player Stats,', 0.9838525652885437)]
[[[173.0, 1618.0], [1439.0, 1618.0], [1439.0, 1650.0], [173.0, 1650.0]], ('Team Stats, Venue Stats for each season or among all seasons using various visualization tools.', 0.9894545674324036)]
[[[99.0, 1700.0], [901.0, 1703.0], [901.0, 1735.0], [99.0, 1732.0]], ('RESEARCH AND DEVELOPMENT WORKS : Github', 0.9899471998214722)]
[[[164.0, 1749.0], [1554.0, 1749.0], [1554.0, 1778.0], [164.0, 1778.0]], ('Worked on research papers like Style based Generative Adversarial Networks (Style GAN), Variationa', 0.9793515801429749)]
[[[173.0, 1794.0], [1561.0, 1794.0], [1561.0, 1826.0], [173.0, 1826.0]], ('Autoencoders (VAE), Causal Bayesian Networks, Self-Normalizing Neural Networks(SELU), Adaptive', 0.992547869682312)]
[[[171.0, 1836.0], [899.0, 1838.0], [898.0, 1870.0], [171.0, 1868.0]], ('Structural Learning of Artificial Neural Networks, etc.', 0.9810876250267029)]
[[[129.0, 1881.0], [1453.0, 1879.0], [1453.0, 1911.0], [129.0, 1914.0]], (".Implemented models for different types of GAN's like DCGAN, Wasserstein GAN, Style GAN.", 0.9830083847045898)]
# draw result
from PIL import Image, ImageDraw, ImageFont
image = Image.open(img_path).convert('RGB')
boxes = [line[0] for line in result[0]]
txts = [line[1][0] for line in result[0]]
scores = [line[1][1] for line in result[0]]
font = ImageFont.load_default()
print(scores)
print(len(boxes))
im_show = draw_ocr(image, boxes, txts, scores,font_path='/usr/share/fonts/truetype/humor-sans/Humor-Sans.ttf' )
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')
[0.9807722568511963, 0.9907920360565186, 0.9905869364738464, 0.9889769554138184, 0.9960023760795593, 0.9906994104385376, 0.9983083009719849, 0.9688986539840698, 0.9901943206787109, 0.9949930906295776, 0.9746212363243103, 0.999392569065094, 0.9921110272407532, 0.9950522184371948, 0.9903169870376587, 0.9906477928161621, 0.9962514638900757, 0.998393714427948, 0.9987578392028809, 0.984175443649292, 0.9995238780975342, 0.9818212389945984, 0.9923768043518066, 0.9865124821662903, 0.991465151309967, 0.9956429600715637, 0.9910761713981628, 0.9935489892959595, 0.9993253946304321, 0.9742147326469421, 0.9925193786621094, 0.9838525652885437, 0.9894545674324036, 0.9899471998214722, 0.9793515801429749, 0.992547869682312, 0.9810876250267029, 0.9830083847045898] 38
img = cv2.imread('/content/result.jpg', cv2.IMREAD_UNCHANGED)
cv2_imshow(img)
From above I can see that paddleocr has extracted text from the resume so accurately.
Let's check the other way of extracting text. It is one of the famous library. Teserract Engine is being used by many libraries internally
Tesseract is an Optical Character Recognition Engine for various operating system.
To get acess to this engine in our operating system we need to install the following module in out system.
To install tesseract engine in our machine we need to run the following module based on the system environment.
sudo apt-get install tesseract-ocr
For macOS users, we’ll be using Homebrew to install Tesseract
brew install tesseract
If you just want to update tesseract without updating any other bre components. Use the following command.
HOMEBREW_NO_AUTO_UPDATE=1 brew install tesseract
Once we install the above commands based on our environment, we will have access to tesseract engine in our machine. Also we need one of the Python library "Pytesseract" to run the tesseract model which we already installed through our requirements.txt.
Tesseract used the power of OCR with AI to capture data from structured and unstructured data. This module extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats.
Like PaddleOCR, Tesseract model is also a light weight and can be run with or without GPU. Also, Tesseract give the extra abilities to predict the images from the BLUR background or the bright background.
!pip3 install pytesseract
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: pytesseract in /usr/local/lib/python3.8/dist-packages (0.3.10) Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.8/dist-packages (from pytesseract) (21.3) Requirement already satisfied: Pillow>=8.0.0 in /usr/local/lib/python3.8/dist-packages (from pytesseract) (9.0.0) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging>=21.3->pytesseract) (3.0.9)
Now import all necessary modules
from PIL import Image
import pytesseract
import argparse
from google.colab.patches import cv2_imshow
import os
from PIL import Image, ImageDraw
from pathlib import Path
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
ls "/content/drive/MyDrive/IUB Fall 22/Advance NLP/Project/data"
others/ resume_images/ wantok_images/
file_path = "/content/drive/MyDrive/IUB Fall 22/Advance NLP/Project/data"
images_path = list(Path(file_path + "/wantok_images").glob("*"))
image = Image.open(images_path[0])
Once we load the data we can directly run the pyteserract API on the image
preprocess = None
image = cv2.imread(str(images_path[0]))
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# check to see if we should apply thresholding to preprocess the
# image
if preprocess == "thresh":
gray = cv2.threshold(gray, 0, 255,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# make a check to see if median blurring should be done to remove
# noise
elif preprocess == "blur":
gray = cv2.medianBlur(gray, 3)
# write the grayscale image to disk as a temporary file so we can
# apply OCR to it
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, gray)
# load the image as a PIL/Pillow image, apply OCR, and then delete
# the temporary file
predicted_text = pytesseract.image_to_string(Image.open(filename))
os.remove(filename)
print(predicted_text)
Long yia 1968 dispela Visiting
Misin bilong Yunaitet Nesens i
kamap long Niugini. Hia ol i stap
long Rabaul long taim bilong ilek-
sen. Nem bilong ol em hia (kirap
long lephan): Mista A. V. Caine
(Liberia), Mista J.M. McEwan (Nu
Silan), Mista W. Allen (Amerika),
na Mista P. Gaschignard (Frans).
( D.I.E.S.poto )
Trinde, Mas 3, 1971
©1 memba bilong lain bilong Yunaitet Nesens i kam
lukluk raun long Teritori long yia 1965. Em hia ol
i stap long Kompian. 01 memba i sindaun i stap (ki-
rap long lephan): Mista Dermot J.
Andre Naudy (Frans), Dwight Dickinson (Amerika),
Nathaniel Eastman (Liberia). (Yunaitet Nesens poto)
Swan (Englan),
01 memba bilong 1971 Visiting Misin bilong Yu-
naitet Nesens i bung insait long
bus long Isten Hailans. Em ol hia
lephan}: Sir Denis Allen (Englan), Mista Paul
Blanc (Frans), na Mista Adnam Raouf (Irak)
( D.I.E.S. poto )
wanpela haus
(kirap long
From above output cell, we can see the extracted text.
I am trying to extract the text from the wantok images using different ways. Actuallt I have worked on one more method of extractig text which is Google Cloud. I have inserted that code in the github and you can find that from Google Cloud API.
Also you can find the above methods of python version in the Github.
Also I have planned to do two more things on this project which are Evaluating metrics and fine tuning the Layout Language Model. But I couldn't get the labels of the current dataset. So currently I am not able to do that but for sure I am going to generate some ground truth labels and perform some evaluationa and fine tune the model to get better results